Skip to main content

BERTopic vs CTM vs Top2Vec

Performance comparison of Proposed BERTopic topic model with baselines: CTM and Top2Vec
Created on September 2|Last edited on September 3

Topic Coherence Score

Topic coherence is a quantified metric of the coherence of a set of facts. For example, “What clubs do you belong to?”, “Who organises your leisure activities?” and “How many times a week do you participate in sports clubs?” would be considered a coherent set of facts as they all cluster around the topic of clubs and leisure.
Topic coherence can be calculated using a variety of measures. In these experiments, Normalized pointwise mutual information (npmi) was used, which is the popular metric pointwise mutual information but normalized between -1 and 1. For example, an extremely incoherent sets of topics would be -1, and an extremely coherent set of topics would be 1, although in practice topic coherence values are closer to 0. The higher the topic coherence, the better.
In the diagram below, it is clear that across all number of topics, BERTopic outperforms CTM and Top2Vec, with Top2Vec being the worst performer.


Run set
3


Topic Diversity Score

Topic diversity measures the distribution of the language of the topic model. This is important along with topic coherence to ensure the documents are not only coherent, but diverse in the range of language used.
The higher the topic diversity, the better. However, topic coherence negatively correlates with topic diversity, so a balance between the hyperparameters which increase one but decrease the other must be found.
In the diagram below, BERTopic maintains very high topic diversity scores across all number of topics, with CTM being competitive for low number of topics. Top2Vec is clearly the worst performer, again.

Run set
3


Computation Time

Finally, the relative computation times per model for CTM, BERTopic and Top2Vec in seconds. It is important to note that CTM is 18 times slower than BERTopic. Each BERTopic run takes approximately 25 seconds, whereas each CTM run takes 450 seconds. (Top2Vec takes approximately 40 seconds per run.)
Therefore, BERTopic is not only the best performer in terms of topic coherence and topic diversity, it is also consistently the fastest.

Run set
3